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Abstract — Large graphs are difficult to represent, visualize, 
and understand. In this paper, we introduce "gate graph" - a new 
approach to perform graph simplification. A gate graph provides 
a simplified topological view of the original graph. Specifically, 
we construct a gate graph from a large graph so that for any 
"non-local" vertex pair (distance higher than some threshold) in 
the original graph, their shortest-path distance can be recovered 
by consecutive "local" walks through the gate vertices in the 
gate graph. We perform a theoretical investigation on the gate- 
vertex set discovery problem. We characterize its computational 
complexity and reveal the upper bound of minimum gate-vertex 
set using VC-dimension theory. We propose an efficient mining 
algorithm to discover a gate-vertex set with guaranteed logarith- 
mic bound. We further present a fast technique for pruning 
redundant edges in a gate graph. The detailed experimental 
results using both real and synthetic graphs demonstrate the 
effectiveness and efficiency of our approach. 



I. Introduction 

Reducing graph complexity or graph simplification is be- 
coming an increasingly important research topic IT], 1321 . 
1371 . l34l . l30l . It can be very challenging to grasp a graph 
even with thousands of vertices. Graph simpUfication targets at 
reducing edges, vertices, or extracting a high level abstraction 
of the original graph so that the overall complexity of the graph 
is lowered while certain essential properties of the graph can 
still be maintained. It has been shown that such simplification 
can help understand the underlying structure of the graph lT3l . 
H]; better visualize graph topology ll26l . lT3l : and speed up 
graph computations l2l. l23l. 171. |[T9l. ll33l. ll30l. 

In this paper, we investigate how to extract a set of vertices 
from a graph such that the vertex locations and relationships 
not only help to preserve the distance measure of the original 
graph, but also provide a simplified topological view of the 
entire graph. Intuitively, these vertices can be considered to 
distribute rather "evenly" in the graphs in order to reflect 
its overall topological structure. For any "non-local" vertex 
pair (distance higher than some threshold), their shortest- 
path distance can be recovered by consecutive "local" walks 
through these vertices. Basically, these vertices can be viewed 
as the key intermediate highlights of the long-range (shortest- 
path distance) connections in the entire graph. In other words, 
for any vertex to travel to another vertex beyond its local range, 
it can always use a sequence of these discovered vertices (each 
one being in the local range of its predecessor) to recover its 
shortest path distance to the destination. Thus, conceptually, 
this set of vertices form a "wrap" surrounding any vertex in 
the original graph, so that any long range (shortest-path) traffic 



goes through the "wrap". From this perspective, these vertices 
are referred to as the gate vertices and our problem is referred 
to as the gate-vertex set discovery problem. Furthermore, these 
gate vertices can be connected together using only "local" 
links to form a gate graph. A gate graph not only reveals 
the underlying highway structure, but also can serve as a 
simplified view of the entire graph. Gate- vertex set and gate 
graph have many applications in graph visualization lfT4l . ||5l 
and shortest path distance computation l33]| . 

A. Problem Definition 

Let G = (V, E) be an unweighted and undirected graph, 
where V ~ {1, 2, N} is the vertex set and E C V x V is 
the edge set of graph G. We use {u, v) to denote the edge from 
vertex u to vertex v, and Pvo,v — {vo,vi, ...^Vp) to denote 
a simple path from vertex vq to vertex Vp. The length of a 
simple path in unweighted graph is the number of edges in 
the path. For weighted graph, each edge e G is assigned a 
weight w{e). The length of a simple path in a weighted graph 
is the sum of weights from each edge in the path. The distance 
from vertex u to vertex v in the graph G is denoted as d{u, v), 
which is the minimal length of all paths from u to v. 

Given a user-defined threshold e > 0, for any pair of 
connected vertices u and v, if their distance is strictly less than 
e ((i(u, v) < e), we refer to them as a local pair, and their 
distance is referred to as a local distance; if their distance 
is higher than or equal to e but finite, we refer to them as 
a non-local pair, and their distance is referred to as a non- 
local distance. In addition, we also refer to e as the locality 
parameter or the granularity parameter 

Definition 1: (Minimum Gate- Vertex Set Discovery 
(MGS) Problem) Given an unweighted and undirected graph 
G — {y^E) and user-defined threshold e > 0, vertex set 
C y is called a gate-vertex set if V* satisfies the following 
property: for any non-local pair u and v {d{u, v) > e), there 
is a vertex sequence formed by consecutive local pairs from 
u to V, {u,vi,V2, • • ■ where vi,V2, ■ ■ ■ ,Vk eV*, such 

that d{u,Vi) < e, d{vi,V2) < e, ••■ , d{vk,v) < e, and 
d{u,vi) + d{vi,V2) + ■ ■ ■ + d{vk,v) ~ d{u,v). The gate vertex 
set discovery problem seeks a set of gate vertices with smallest 
cardinality. 

In other words, the gate-vertex set guarantees that the 
distance between any non-local pair u and v can be recovered 
using the distances from source vertex u to a gate vertex 
wi, between consecutive gate vertices, and from the last gate 
vertex Vk to the destination vertex v. These are all local 



distances. Here, the local distance requirement for recovering 
any non-local distance enables the gate vertices to reflect 
enough details of the underlying topology of the original graph 
G. Based on the gate-vertex set, we can further define the gate 
graph. 

Definition 2: (Minimum Gate Graph Discovery (MGG) 
Problem) Given an unweighted and undirected graph G = 
(y,E) and a gate-vertex set V* (V* C V) with respect 
to parameter e, the gate graph G* = {V*,E*,W) is any 
weighted and undirected graph where W assign each edge 
e G E* a weight w{e), such that for any non-local pair u and 
w in G {d{u, v) > e), we have d{u, v) = 

mina(u,x)<e/\d(y,v)<eAx.yev*d{u,x) + d{x,y\G'') + d{y,v); 

Here d{x,y\G*) is the distance between x and y in the 
weighted gate graph. The gate graph discovery problem seeks 
the gate graph with the minimum number of edges. Note that 
the edges in the gate graph may not belong to the original 
graph. 

Our Contributions: 

1) We introduce and formally define the new gate- vertex set 
and gate graph discovery problems, which are applicable to 
numerous graph mining tasks; 

2) Based on basic properties of gate vertices, we perform a 
theoretical study on gate-vertex set by connecting it to the 
theory of VC-dimension, and prove NP-hardness of minimum 
gate-vertex set discovery problem; 

3) We develop an efficient mining algorithm based on the set- 
cover framework to discover the gate-vertex set with guar- 
anteed logarithmic approximation bound. We discuss a fast 
approach to prune redundant edges in gate graph; 

4) We perform a detailed experimental evaluation using both 
real and synthetic graphs. Our results demonstrate the effec- 
tiveness and efficiency of our approach. 

II. Related Problems and Work 

Graph Simplification: Our work on discovering a gate-vertex 
set and gate graph can be categorized as graph simplification 
with focus on preserving shortest path distance measure. The 
most intuitive graph simplification method is graph cluster- 
ing HI or decomposition 1251 . which provides a high-level 
view of the graphs. However, this approach mainly focuses 
on discovering the community structure of the graph, and its 
representation is generally too coarse to preserve many other 
essential information of the graphs (such as the connectivity 
and shortest-path distance measure). Several recent efforts 
study how to simplify the graphs while maintaining its key 
graph properties, such as the effective resistance 1321 . connec- 
tivity 1371 . and other path-oriented measures l34l . However, 
in these studies, the simplified graph is a spanning subgraph 
of the original graph, and thus does not reduce the overall 
scale of the graph in terms of the number of vertices. In our 
work, we instead focus on discovering a subset of essential 
vertices which can maximally recover the all-pair shortest-path 
distances with respect to the locality parameter e. 

In order to better visualize a large graph, the visualization 
community has proposed several methods to simplify graphs. 



For instance, authors in l26l consider sampling a subgraph 
from the original graph for visualization, and in lT3l . the 
authors develop a pruning framework to remove unimportant 
vertices in terms of their betweenness and other distance- 
related measures. These methods are in general heuristically- 
oriented and cannot provide quantitative guarantee on how 
good the graph properties are preserved. Our gate-vertex set 
and gate graph provide a new means to visualize large graphs 
and assist distance-centered graph visualization and analysis. 

Finally, several works 0, 123, EO) study how to 

extract a concise subgraph which can best describe the rela- 
tionship between a pair or a set of vertices in terms of electric 
conductance lU, l35l or network reliability ITSll . l20l . Our 
goal is to depict the shortest-path distances using gate vertices 
and gate graph. 

Shortest-Path Distance Computation: Computing shortest- 
path distance is a fundamental task in graph mining and 
management. Many important graph properties, such as graph 
diameter, betweenness centrality and closeness centrality, are 
all highly dependent on distance computation. Even though 
the BFS approach for computing pair-wise distance is quite 
efficient for small graphs, it is very expensive for large graphs. 
Leveraging the highway structure to speed up the distance 
computation has been shown to be quite successful in road- 
network and planar graphs HT], fM, ED, ES. The recent 
fc-skip graph ||33l work represents the latest effort in using 
highway structure to reduce the search space of the well- 
known shortest distance computation method. Reach ITTl . 
Basically, each shortest path is succinctly represented by a 
subset of vertices, namely fc-skip shortest path, such that it 
should contain at least one vertex out of every k consecutive 
vertices in the original shortest path. In other words, fc-skip 
shortest path compactly describes original shortest path by 
sampling its vertices with a rate of at least 1/fc. Tao et al. 
show that those sampled vertices can be utilized to speed up 
the distance computation. Following the similar spirit, gate- 
vertex set and gate graph directly highlight the long-range 
connection between vertices, and can also serve as a highway 
structure in the general graph. 

We note that the fc-skip cover and gate vertices are conceptu- 
ally close but different. The fc-skip cover intends to uniformly 
sample vertices in shortest paths, whereas the gate-vertex 
set tries to recover shortest-path distance using intermediary 
vertices (and local walks). More importantly, the fc-skip cover 
focuses on the road network and implicitly assume there is 
only one shortest path between any pair of vertices ll33l . In this 
work, we study the generalized graph topology, where there 
may exist more than one shortest path between two vertices 
which is very common in graphs such as a social network. 
Our goal is to recover the non-local distance between any pair 
of vertices using only one shortest path. Finally, in this paper, 
we focus on developing methods to discover minimum gate- 
vertex set, whereas l33ll only targets at a random set of vertices 
which forms a fc-skip cover, i.e., the minimum fc-skip cover 
problem is not addressed. 

Landmarks: Landmark vertices (or simply landmarks) are a 
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subset of vertices in the graph which are selected and utilized 
for graph navigation (particularly shortest-path distance com- 
putation) H, mi, EB, HOl, 113, 113, m and transformation 
(multidimensional scaUng) 13]. Given a landmark set, each 
vertex in the graph can approximate its network "position" by 
its distances to each landmark. Thus, each vertex is directly 
mapped to a multidimensional space where each landmark 
corresponds to a unique dimension. In online shortest-path 
distance computation, landmarks have been used together with 
triangle inequality for pruning search space IfTOl ; several stud- 
ies directly utilize landmarks for distance estimation ||2T1 . l27l . 
123, El - However, the landmarks generally are not necessarily 
good representatives for highlighting the underlying topology 
of the entire graphs, while the gate vertices explicitly ensure 
any pair-wise distance can be recovered through user-defined 
granularity threshold. 

Vertex Separators: Vertex separators |28l are a set of vertices 
(denoted as S) in a graph G which partition the entire vertex 
set V into three sets. A, S and B, where there are no edges 
between vertices in A and B. Using vertex separators, a graph 
can be decomposed recursively. This is often used as a basis 
for applying a divide-and-conquer approach to (hard) graph 
problems. The gate vertices are different from separators as 
they do not have to explicitly partition the graphs. In particular, 
if there are multiple non-local shortest paths between two 
vertices, the gate vertices will guarantee to recover at least 
one of them. Thus, the gate vertices in some sense relax the 
condition of vertex separators and thus allow us to recover the 
shortest-path distance even on general graphs. 

III. Properties of Gate-Vertex Set and Problem 
Transformation 

Based on the definition of the gate-vertex set, to verify that a 
given set of vertices V* is a gate-vertex set, the naive approach 
is to explicitly verify that the distance between every non- 
local pair (it, v) can be recovered through some sequence of 
consecutive local pairs: (u, vi), (vi, V2), • • • , {vk, v), where all 
intermediate vertices Vi,V2, ■ ■ ■ ,Vk S V*. Clearly, this can 
be expensive and difficult to directly apply to discover the 
minimum number of gate vertices. In Subsection IIII-AI we 
first discuss an alternative (and much simplified) condition, 
which enables the discovery of gate-vertex set using only 
local distance, and reveal the NP-hardness of minimum gate- 
vertex set discovery problem. In addition, we utilize the VC- 
dimension theory to bound the size of gate vertices. 

A. Local Condition and Problem Reformulation 

In order to design a more efficient and feasible algorithm, 
we explore the properties of gate vertices and observe that 
gate-vertex set can be efficiently checked by a very simple 
condition. Let G ~ {V, E) be an unweighted and undirected 
graph. For any vertex u G V , its e-neighbors, denoted as 
Ne{u) is a set of vertices such that their distances to u is no 
greater than e, i.e., N^{u) = {u 6 V\0 < d{u,v) < e}. Let L 
be a set of vertices and S — {{uq,vo), {uk,Vk)} be a set 



of vertex pairs in the graph G. We say that L covers S if for 
each vertex pair {ui,Vi)€ S there is at least one vertex x G L 
such that d(ui, Vi) — d(ui, x) + d{x, Vi). 

Now, we introduce the following key observation: 
Lemma 1: (Sufficient Local Condition for MGS) If 
for each vertex x in the graph G, there is a subset of 
vertices L{x) C N^-i{x) which covers all vertex pairs 
{{x,yi)\d{x,yi) = e}, then Uxgy^l*^) ^ gate-vertex set 
of graph G. In other words, a vertex set which covers any 
pair of vertices with distance e is a gate-vertex set. 

Proof Sketch:For any non-local pair s and t (d{s,t) > 
e), we denote one of their shortest paths to be P = (s = 
vo,vi,--- ,Vk = t) with the length k = d{s,t) > e. Let us 
consider Ve on the shortest path. Since d{s, Ve) ~ e, there is at 
least one vertex xq G L{s), such that d{s,xa) + d{xo,Ve) = 
d{s,Ve). Now, we consider two cases: 

1) If d{xo,t) < e, then we recover a local-walk sequence 

(s,xo,t). 

2) If d{xQ,t) > e, since d{s,xo) + d{xQ,t) — d{s,t) (based on 
the fact d{xQ,v^) + d{Vf,t) = d{xo,t)), we have d{xQ,t) < 
d{s,t). Then, we can recursively apply the above method to 
identify Xi between xq and t, X2 between xi and t, until 

d{xi, t) < e. 

Since xo,xi, - ■ ■ ,Xi G U^ey ^(^) (i-^- they also belong to 
V*) and the distance of every vertex pair {xm,Xm+i) is less 
than e, we can recover the distance between s and t to be 
(s, .To, xi, • • • , Xi,t), where d{s, xq) < e, d{xo,Xi) < e, • • • , 
d{xi,t) < e and d{s,XQ) + Zlmio '^(^'"' ^"J+i) +d{xi,t) = 
d{s, t). Therefore, Uxev -^(■^) °f vertices V which 

can cover any vertex pair with distance e, is a gate-vertex set. 
□ 

Interestingly, this local condition specified in Lemma [T] is 
also a necessary one for a gate-vertex set. 

Lemma 2: (Necessary Local Condition for MGS) Given 
an unweighted and undirected graph G and its gate-vertex set 
V* with respect to parameter e, for any vertex s G we have 
L{s) ~ {x G V*\Q < d{s, x) < e} such that for any vertex t 
with distance e to s (i.e., d{s,t) — e), there is a; G L{s) with 
d{s, t) ~ d{s, x) + d{x, t). 

Proof Sketch:For any non-local vertex pair (s, t) with 
d{s, t) = e, by the definition of gate-vertex set, there must exist 
a sequence of vertices a;o,xi, ...,Xi G V*, such that d{s,t) = 
d{s, xo) + d{xo, xi) + ... + d{xi,t) where d{s, xq) < e, • • • , 
d{xi,t) < e. Since d{s,Xf)) < e, we have xq G L{s). Also, 
it is easy to see that d{s,t) = d{s,xo) + d{xo,t) because 
d{s,t) — d{s,xo) ~ d{xo,xi) + ... + d{xi,t) > d{xQ,t) 
and d{s,t) < d{s,xo) +d{xo,t). Therefore, we have at least 
Xo & L{s) satisfying d{s,t) ~ d{s,xo) +d{xo,t). □ 

Putting this together, given parameter e, checking whether 
a subset of vertices V* C V is gate-vertex set is equivalent 
to checking the following condition: for any vertex pair {u, v) 
with distance e, there is a vertex x G V* such that d{u, v) = 
d{u, x)+d{x, v). Similarly, we can rewrite the minimum gate- 
vertex set discovery problem in the following equivalent local 
condition (only covering vertex pairs with distance e). 

Definition 3: (Minimum Gate- Vertex Set Problem using 
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Local Condition) Given unweighted undirected graph G = 
(y, E) and user-defined threshold e, we would like to seek a 
set of vertices V* with minimum cardinality, such that any 
pair of vertices {u, v) with distance e is covered by at least 
one vertex x £ V*: d{u, x) + d{x, v) — d{u, v). 

In the following, we would like to prove NP-hardness of 
aforementioned problem by reducing the 3SAT problem. 

Theorem 1: (NP-hardness of MGS using Local Con- 
dition provided Shortest Paths) Given a collection P of 
vertex-pair with d{u,v) = e denoting a set of shortest 

paths from unweighed undirected graph G = (V, E), finding 
minimum number of vertices V* C V such that any vertex- 
pair (u, v) is covered by at least one vertex x £ V* is NP-hard. 

Proof Sketch:We can reduce 3SAT problem to this problem. 
Let S be an instance of 3SAT with n variables xi, X2, Xn, 
and m clauses Ci, G2, Cm- We show that an instance of 
our problem can be constructed correspondingly as follows. A 
unweighted undirected graph G consisting of a vertex p and a 
set of variable gadgets and clause gadgets will be generated. 

The variable gadget with respect to variable x contains 3 
vertices and 2 edges: 

1) 3 vertices: b^, and e^; 

2) 2 edges: (&^,e^) and (6^e^). 

Also, we add edges (p, b^) and {p, b^') to build the con- 
nections between p and variable x's gadget. For each clause 
Ci = {X, y, Z), we add vertex to graph G first. Then, if 
X — X, we add edge {b^ , Ci), otherwise, we add edge (6^, Ci). 
The same rule is applied to literals Y and Z. Next we add 
3 edges (&^,Ci), (&^,Ci) and {b^,Ci) into G. The subgraph 
containing vertices {p, b^ ,b^ ,b^ ,Ci} and above created edges 
is called clause gadget regarding C,;. 

Here we consider e = 2 in the graph G. That is, we try to 
find a set of vertices V* with minimum cardinality to cover 
a collection P of shortest path with length e (i.e., 2 in this 
scenario). Note that, in our problem, for shortest path SP = 
{x, y, z) with length 2, only vertex y in the middle can be 
used to cover SP identified by its two endpoints (x, z). Let 
us define Ppef to be vertex pairs indicating shortest paths with 
length 2 between p and e^. Moreover, Ppc^ denotes vertex 
pairs representing shortest paths with length 2 between p and 
Cfc. We consider gate-vertex selection problem on vertex pairs 

P=(U,Fpe^,)U(UfcPpcJ . 

In the following, we prove that above 3SAT instance is 
satisfiable if and only if the instance of our problem has a 
solution of size at most n. We need to prove both the "only 
if" and the "if" as follows. 

=►: Suppose 3SAT instance S is satisfiable and / is its 
corresponding satisfying assignment. For each variable x, if 
f{x) = 1, vertex b^ is added to S* to cover the shortest 
paths within P^a where clause Ci contains x or x. Otherwise, 
we add vertex b^ into S* . As we can see, either b^ or 6^ is 
selected, shortest paths within Ppf.^ always can be covered. 
On the other hand, since one of literals X in each clause C 
is guaranteed to be true, there is always one vertex indicating 
such literal serves as intermediate hop in shortest paths of Ppc. 
In this sense, only one vertex corresponding to the literal with 
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Fig. 1 : Example for proof of Theorem [T] 



true value is added to V* in each variable gadget. Therefore, 
the solution size for the instance of our problem is n. 
■4=: Suppose graph G has a gate-vertex set V* of size n 
with respect to e = 1. For 3SAT instance S, we define a truth 
assignment by setting f{x) = 1 if and only if vertex b^ is 
included in V* . We will show this is satisfying assignment 
without conflict. First, according to the definition of gate- 
vertex set, there are at least one vertex from V* cover vertex 
pair Ppci , meaning that there are at least one literal with truth 
value existing in each clause. This leads to the truth value 
of entire 3SAT instance. Furthermore, in order to cover every 
vertex pair Ppe^i , either vertex 6^' or must be chosen. 
Considering the constraint \ V*\ < n, for each variable gadget, 
only one of 6^' and 6^' can be included in gate-vertex set. 
From the perspective of 3SAT instance S, this guarantees 
that only one of literals Xi and Xi would be assigned with 
true value. That is, no conflict occurs in our aforementioned 
assignment. Putting both together, we can claim that / is a 
satisfying truth assignment. □ 

Example: consider the following boolean formula S in 
3SAT problem with respect to variables x y and z: 

(x V y V z) A (x V y V z) A (x V y V z) A (a- V y V z) 

To simplify the discussion, we name the 
above 4 clauses as Ci C2 C3 and C3, re- 
spectively. Here, we consider the vertex pairs 
P = {b,e^), (p,ey), (p,e"'), (p,ci), (p,C2), (p,C3), (p,C4)} 
with distance e (i.e., 2 in the example). The constructed graph 
G including variable gadgets and clause gadgets is shown 
in Figure [T] In this example, the formula is satisfied by the 
assignment x ^ 1, y = 1 and z = 0. According to the rule 
defined in the proof, we add b^, b^ and b^ to gate-vertex set 
V*. It is not hard to verify that no vertex set with less vertices 
compared to \V*\ can be obtained. From another direction, 
we can see that V* = {6^,6^,6^} C V{G) is minimum 
gate-vertex set with respect to vertex pairs in P (i.e., the 
problem is equivalent to find the set cover problem: ground 
set {ci, C2, C3, C4}, candidate sets {5'''- , 6^} {X G {x,x}, 
Y G {y,y} and Z G {z,z})). The corresponding satisfying 
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assignment can be build as follows: a; = 0, y = and z = 1. 
It is straightforward to verify that this is a truth assignment 
for instance S. 

B. Size of Minimum Gate-Vertex Set 

In the following, using the theory of VC-dimension, we 
derive an upper bound of the cardinality of minimum gate- 
vertex set. 

VC-dimension and e-net: We start with a brief introduction 
of the VC-dimension of set systems and e-net. The notion of 
VC-dimension originally introduced by Vladimir Vapnik and 
Alexey Chervonenkis in [36] is widely used to measure the 
expressive power of a set system. Let [/ be a finite set and R 
a collection of subsets of U , the pair (J7, R) is referred to be 
a set system. A set A C [/ is shatterable in R if and only if 
for any subset S of A, there is always a subset X G R where 
X O A = S. In other words, X contains the "exact" 5" with 
no element in A\S. Then, we say the VC-dimension of set 
system {U, R) is the largest integer d such that no subset of U 
with size d+1 can be shattered. In addition, given parameter 
e e [0, 1], a set C [/ is an e-net on ([/, R) if for any subset 
X G R, X has size no less than e\U\, the set A^ contains 
at least one element of X. For the set system with bounded 
VC-dimension d, the e-net theorem states there exists a e-net 
with size 0{j\og\) |fT2| . 

Using the VC-dimension and e-net theorem, we can bound 
the size of minimum gate-vertex set. 

Theorem 2: Given graph G = (V, E) with parameter 
e, the size of minimum gate-vertex set is bounded by 
O(i^logM). 

To prove Theorem |2] we need a few lemmas. To facilitate 
our discussion, we introduce the following notations. Given 
the input graph G = (y, E), let p* ^ be the subpath of 
shortest path ps t without the two endpoints. For instance, 
if ps^t = {s,u, ...,v,t), its corresponding p* ^ = {u,...,v). 
Further, let Pi only contains shortest path ps,t of length I, i.e.. 
Pi = ^s,t{ps,t s.t. \ps^t \ = I}- Given Pi with / > 1, we say P,* 
is a core-set of Pi if for each shortest path ps^t , only its subpath 
pi t is included in P;*, i.e., Pf = Us,t{p*s,t s-t- Ps,t e Pi}. 

We first establish the relationship between e-net and gate- 
vertex set. 

Lemma 3: (^^-net) Given a set system (V, P*), where 
contains a shortest path for every vertex pair with distance e 
in graph G = {V, E) (P* is the core-set of PJ, a ^-net V* 
of {V,P*) is a gate-vertex set. 

Proof Sketch:By definition of P*, the number of vertices in 
shortest path p„ „ G P* is e — 1. According to definition of 
fej--net, for each shortest path ps^t € P*, we have ps.tCiV* ^ 
0. Moreover, recall that each shortest path of P* is a subpath of 
some shortest paths of P^ by removing two endpoints. In other 
words, if V* contains at least one vertex from each shortest 
path in P/, then at least one vertex from shortest path in P^ is 
included in V* . Since Pe contains one shortest path for every 
vertex pair with distance e, this satisfies the condition of gate- 
vertex set such that there is at least one vertex x gV* holding 



(i(M, v) = d{u, x) + d{x, v) for every vertex pair (u, v) with 
d{u, v) = e. Therefore, V* is a gate-vertex set. □ 

To bound the size of e-net, the VC-dimension of the set 
system is needed. In ||9l, ll33l . the VC-dimension of a unique 
shortest path system, i.e., only one shortest path exists between 
any pair of vertices in a graph, is studied. Formally, we first 
define Unique Shortest Path System: 

Definition 4: (Unique Shortest Path System (USPS)) 
Given a graph G = {V,E) and a collection Q of shortest 
paths from G, we say Q is a unique shortest path system if: 
any vertex pair u and v is contained in two shortest paths 
Psi,ti,Ps2,t2 G then they are linked by the same path, i.e., 
Pu.v = Pu,in where pu,v (K,«) is the subpath of ps^,ti (Psa^tj)- 

For any unique shortest path system (V, P), it can be easily 
verified that its VC-dimension is 2 ||9l, ll33l . Thus, if a graph 
contains only one unique shortest path system, then, the bound 
described in Theorem |2] can be directly derived (following 
the e-net theorem |fT2| ). However, in our problem, there can 
be many different shortest paths between any given pair of 
vertices. To deal with this problem, we make the following 
observation: 

Lemma 4: (Existence of USPS) Given any graph G ~ 
{V, E), there exists a unique shortest path system P in G. 
Proof Sketch: We prove this lemma by induction on the edge 
size of graph G. We first assume that when a graph G has 
\E\ ^ N edges, it has a unique path system P. Now, we add 
a new edge e = {x, y) in E (the new edge can introduce a new 
vertex, the new graph is denoted as G"), Then, we first drop all 
the p„ „ G P (P is in G), such that Pu^v is not the shortest path 
between u an u any more, i.e., \pu,v\ > d{u, v\G'). Clearly, the 
remaining P is still a unique shortest path system. Now, for 
the dropped vertex pair u and v, we must be able to construct 
a new shortest path between u and v using edge e = {x, y) as 
follows: pu,x U (e = {x, y)) Upy^y, where pu,x and py^„ belong 
to the remaining P. By adding those new shortest paths to P, 
we claim P is the unique shortest path system containing a 
shortest path between any vertex pairs in G' . This is because 
for any vertex pair s and t, either they has a shortest path which 
does not contain new edge e or contains. For both cases, their 
shortest path is uniquely defined in the new path system. □ 

Basically, we can always extract a USPS from a general 
graph even when there are more than one shortest path between 
any pair of vertices. Combining Lemma[3]and |4] we now can 
prove Theorem [21 

Proof Sketch of Theorem |2l By lemma 21 for any graph 
G = (y, E), we have unique shortest path systems P^ and 
P*, because they are subsets of general USPS P with all 
possible length. Now, for the set system {V,P*), we know 
that: 1) its VC-dimension is at most 2 ID, ||33l; 2) 4^ -net 
on this set system is a gate-vertex set by lemma O Using e- 
net theorem, we have a gate-vertex set (i.e., ^^-net) of size 

0(^3]- log ^^j)- Moreover, the size of minimum gate-vertex 
set is no larger than any gate-vertex set. Putting both together, 
the theorem follows. □ 

Lower Bound: The lower bound of the minimum gate-vertex 
set can be arbitrarily small. For example, in Figure|2l minimum 
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Fig. 2; Gate-vertex set (e ~ 3) 



gate-vertex set is only central vertex, and no gate vertex is 
needed for any graph with diameter less than e. In this case, 
even a gate- vertex set of size 0(^^ log is obtained, we 
still cannot decide how good it is compared to the minimum 
gate-vertex set. 



IV. Algorithms for Gate-Vertex Set Discovery 

Based on Theorem |2] and e-net theorem 1121, we observe 
that any random sample with size 0(735- log ^^j-) has high 
probability to form a gate-vertex set but does not have a 
guarantee. An adaptive sampling method lf33l is introduced 
to guarantee to find a fc-skip cover The guarantee is achieved 
by choosing a vertex using the information gained from 
previously sampled vertices. Since a fc-skip cover can serve 
as a candidate for the gate-vertex set with e = fc + 1 (as 
stated in Lemma |5]l, we can utilize the adaptive sampling 
method to discover gate-vertex set. However, since the lower 
bound of the minimum gate-vertex set can be arbitrarily small, 
the approximation ratio between the size of the gate-vertex 
set discovered by this method and the minimum one is not 
bounded. In other words, this method does not necessarily 
produce tight gate-vertex set. 

Lemma 5: Given graph G = (V, E), if parameters e of gate- 
vertex set and k of fc-skip cover satisfy condition fc = e — 1, 
fc-skip cover V* is a gate-vertex set. 

Proof Sketch:We prove it by way of contradiction. Let 
us assume V* is not gate-vertex set, meaning, there exists 
a vertex pair (m, v) with distance d{u, v) = e and we do 
not have one vertex x ^ V* (note that x 7^ u, v) such that 
d{u,v) = d{u,x) + d{x,v). By definition of fc-skip cover, 
we guarantee to have one shortest path pu.v in which V* 
contains at least one vertex out of every consecutive fc vertices. 
Therefore, for p„ „, only starting point u and ending point 
V are allowed to be included in V*. However, even both u 
and V are selected, V* still does not contain any vertex from 
subpath pu'.v' with fc vertices (since pu.v has fc + 2 vertices). 
This reaches a contradiction. □ 

Note that a gate-vertex set with locality parameter e = fc + 1 
may not be a fc-skip cover Also, as we mentioned earUer, the 
fc-skip cover focuses on the unique shortest path system, and 
since there may exist more than one shortest path between two 
vertices, the adaptive sampling method chooses one of such 
paths arbitrarily. 

We propose a set-cover-based algorithm with guaranteed 
logarithmic bound and compare it with the adaptive sampling 
method. 



A. Set-Cover Based Approach 

We propose an effective algorithm based on set cover 
framework to discover gate-vertex set with logarithmic bound. 
Specially, we transform the minimum gate-vertex set discovery 
problem (MGS) to an instance of set cover ||3] problem: 
Let U = {{u,v)\d{u,v) = e} be the ground set, which 
includes all the non-local pairs with distance equal to e. Each 
vertex x in the graph is associated with a set of vertex pairs 
Cx — {{u,v)\d{u,v) = d{u,x) + d{x,y) = e}, where 
includes all of the non-local pairs with distance equal to e and 
there is a shortest path between them going through vertex x. 
Given this, in order to discover the minimum gate-vertex set, 
we seek a subset of vertices V* G V to cover the ground set, 
i.e., U ~ Uwev ^'^^^ minimum cost \V*\. Basically, 
V* serves as the index for the selected candidate sets to cover 
the ground set. 

Theorem 3: The minimum solution V* for the above set- 
cover instance is a minimum gate-vertex set of graph G with 
parameter e. 

Its proof can be easily followed by Definition |3] The 
minimum set cover problem is NP-hard, and we can apply the 
classical greedy algorithm [3] for this problem: Let R records 
the covered pairs in U (initially, R = %). For each possible 
candidate set Cx = {{u,v)\d(u,v) = d{u,x) + d{x,y) = e} 
discussed above, we define the price of Gx as: 

At each iteration, the greedy algorithm picks the candidate 
set Gx with the minimum j{Gx) (the cheapest price) and put 
its corresponding vertex x in V*. Then, the algorithm will 
update R accordingly, R ~ RU Gx. The process continues 
until R completely covers the ground set U (R = U), which 
contains all non-local pairs with distance equal to e. It has 
been proved that the approximation ratio of this algorithm is 
ln{\U\) + im- 

Fast Transformation: In order to adopt the aforementioned 
set-cover based algorithm to discover the gate-vertex set, we 
first have to generate the ground set U and each candidate 
subset Gx associating with vertex x. Though we only need the 
non-local pairs with distance e, whose number is much smaller 
than all non-local pairs, the straightforward approach still 
needs to precompute the distances of each pair of vertices with 
distance no greater than e, and then apply such information 
to generate each candidate set. For large unweighted graphs, 
such computational and memory cost can still be rather high. 

Here, we introduce an efficient procedure which performs 
a local BFS for each vertex to visit only its e-neighborhood 
and during this process to collect all information needed for 
constructing the set-cover instance {U and Gx, for x £ V). 
Specifically, for each local BFS starting from a vertex u, it 
has the following two tasks: 1) it needs to find all the vertices 
which is exactly e distance away from u, and then add them 
into U ; and 2) for each such pair (u, v) id{u, v) = e), it needs 
to identify all the vertices x which can appear in a shortest 
path from u to v. In order to achieve these two tasks, we 
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Levels (Ve) (Vj) {Vj,V3,x,.X3) 

(a) Example for Algorithm [T] 



Ground Set; 

S = {(u.v,), (u,v,)) 

Subsets: 

{(u.v^), (u,v,)} 
{(U.Vj), (u,v,)) 

{(U.Vj), (u.v,)} 



(b) Example for Set Cover 

Fig. 3: Example for Gate Discovery Algorithm (e = 3) 



again utilize the basic recursive property of the shortest-path 
distance: Let vertex y to u's distance be d and z ^ ube vertex 
whose distance to u is d — 1, we know all the intermediate 
vertices appearing in at least one shortest path from u to 
z (denoted as I{z)). Then, all the intermediate vertices on 
shortest paths from u to y can be written as [J^^ j,)e_E(-^(^) ^ 
{z}). Based on this property, we can easily maintain I{x) for 
each vertex x such that 1 < d(u, x) < e; when d{u, x) = 1, 
I{x) = 0. Since BFS visits u's e-neighborhood in a level- 
wise fashion, when it reaches the e level, where each vertex 
V is e distance to u, we not only get each targeted pair (u, v), 
but also get I{v), which we can easily use for producing the 
candidate set: for each x G I{v), we add {u,v) to Cx- 

Algorithm [T] sketches the BFS-based algorithm for con- 
structing the set cover instance. Especially, set I{v) is com- 
puted in Line 5, and when BFS reaches the e level (Line 11), 
it adds (m, v) to the ground set U (Line 12) and to each Cx 
[x E I{v)) (Line 13). The algorithm will be invoked for each 
vertex u in the graph. Finally, Figure [3] illustrates a simple 
running example of Algorithm [T| for vertex u with e = 3. 

Algorithm 1 BFSSetCoverConstruction(G = {V, E),e,U,u) 



I{u) 0; level{u) 0; Q ^ {x} {queue for BFS}; 
while Q / do 

u <— Q.pop{); 

if level{v) > 2 {d{u,v) > 2} then 

liv) ^ [J{x,v)eE/\level(x) + l = level(v) ^ i^) ^ {^} 

end if 

for all V G Neighboriu) {{u, x) G E} do 
if V is not visited then 

if level{v)<€ {d{u,v)<e} then 

Q.push_back{v); 
else 

U ^Uu{{u,v)}; 
Vx G I(v), Cx ^ a U{(^i,«)}; 
end if 
end if 
end for 
end while 



Computational Complexity: The overall set-cover based 
mining algorithm for discovering gate-vertex set includes two 
key steps: 1) Constructing set-cover instance (Algorithm [U 
and 2) the greedy set-cover discovering algorithm. The first 
step for collecting ground set and each candidate set takes 
0{Y..^vi\N.{v)\' + \EM\)l where \N,[v)\ {\E,{v)\) is 



the number of vertices (edges) in the u's e-neighborhood. 
For the greedy set cover procedure, by utilizing the speedup 
queue technique ||3T1 . lfT6l . we only need to visit d ^ \V\ 
vertices in the queue (i.e., aU vertices are ranked in ascending 
order in the queue), and each step takes 0{d{\og \V\ + 1)) 
time to exact and update the queue. As greedy procedure 
has 0{\V*\) steps, it takes 0{d\V*\{\og\V\ + 1)) in total. 
Putting together, the overall algorithm's time complexity is 
0{d\V*\\og\V\ + Y..^ymv)n 

V. Algorithm for Gate Graph Discovery 

In this section, we study the gate graph discovery problem 
(Definition 12] in Subsection II- Al l. Basically, after a gate-vertex 
set V* is discovered from graph G, we ask how to minimally 
connecting those gate vertices while still preserving the ability 
of representing non-local distances through consecutive local 
pairs? Specifically, the gate graph G* = {V*,E*,W) is a 
weighted graph, which guarantees for any non-local pair u 
and u in G (d{u, v) > e), d{u, v) = 

™'^d{u,i)<EAd(!/,«)<eAa:,!/sv*rf(w,a;) + d{x,y\G'') + d{y,v); 

Here d{x,y\G*) is the distance between x and y in the gate 
graph. To find the overall sparsest gate graph G* seems to be 
a hard problem. Here, we develop a two-stage algorithm to try 
to maximally prune non-essential edges between gate vertices. 
Stage 1: Constructing Local-Gate Graph. In the first stage, 
for each gate vertex u G V*, we construct a local-gate graph 
G' by connecting two gate vertices only if their distance is less 
than e: G' = {V\E',W), where E' = {(u,v)\d{u,v)<e\ C 
V* X V* , and w{u,v) ~ d{u,v) for [u,v) G E' . In the next 
stage, we will try to sparsify the local-gate graph by removing 
those non-essential edges, i.e., those edges whose removal will 
not affect any shortest-path distance in the gate graph. Why 
we need only local-pairs edges in the gate graph? Lemma |6] 
answers this question. 

Lemma 6: The local-gate graph G' can guarantee that for 
any non-local pair u and t; in G (d(u, v) > e), c?(m, v) = 

mins.(^^^x)<t/\d{y,v)<t/\x,yev'd{u,x) + d(x,y\G') + d{y,v); 

Lemma |6] can be derived directly from the definition of 
gate-vertex set. Its proof is omitted for simplicity. 
Stage 2: Edge Sparsiflcation for Local-Gate Graph. In this 
stage, for each edge in the local-gate graph, we will determine 
whether removing it will change the distance between any 
local pair (if local pair is unchanged, so does non-local 
pair based on the definition of gate vertices). This can be 
equivalently described in the following condition: for any 
edge (w, v) in the local-gate graph G', if there is a vertex 
X (x u, X ^ v), and d{u, x) + d{x, v) = d(u, v), then, edge 
[u, v) is non-essential and can be safely removed from G'. 
How do we test this condition? Using the local-gate graph, 
this becomes very simple! 

Lemma 7: Given local-gate graph G', let N{u) be the 
adjacent gate vertices of vertex u in G'. For any edge (u, v) 
in G', if there is a common vertex x G N{u) D N{v), such 
that w{u, x) + w{x, v) = d{u, v), then, removing edge (u, v) 
from G' will not affect the distance between any two vertices 
in G'; if not, then, edge {u,v) is essential and removing it 
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Dataset 


#V 


#E 


Dia. 


Avg.Dist 


CA-GrQc 


5242 


28980 


17 


6.1 


CA-HepTh 


9877 


51971 


18 


6 


Wiki-Vote 


7115 


103689 


7 


3.3 


P2PG08 


6301 


20777 


9 


4.6 


P2PG09 


8114 


26013 


9 


4.8 


P2PG30 


36682 


88328 


11 


5.7 


P2PG31 


62586 


147892 


11 


5.9 



TABLE I: Real Datasets 



increases the distance between at least one pair of vertices (u 
and v). 

This lemma essentially utilizes the property that in the 
local-gate graph, any pair with distance less than e is linked 
through an edge in G" and thus we do not need to consider 
the situation where an edge can be replaced by a shortest path. 
Here, if an edge can be replaced, there must be a shortest path 
with only two edges. Given this, we can see that the pruning 
algorithm needs to scan the edge set of local-gate graph twice: 
1 ) it applies Lemma [7| to determine whether an edge can be 
removed and flag them; and 2) it removes all the edges being 
flagged to be non-essential. Note that we should not drop an 
edge immediately after we found it to be non-essential since it 
can be used by testing other edges. Finally, the computational 
complexity of the overall edge sparsification algorithm is 

C(Ei,ey (V^-i (^') I + 1-^^-1 C^') I ) + 1^' I ) considering the cost 
of computing the distance between local pairs of the gate 
vertices. 

VI. Experimental Evaluation 

In this section, we empirically study the performance of our 
approaches on both real and synthetic datasets. Specifically, 
we compare two methods in the experiments: 1) FS, which 
corresponds to the approach utilizing adaptive sampling ll33l 
for gate vertices discovery; 2) SC, which corresponds to the 
approach using set cover framework for gate vertices discovery 
(Subsection lIV-Ab . Here, we are interested in understanding 
how many vertices can be reduced by the gate-vertex set and 
how many edges are needed in the gate graph, and how they 
are affected by the locality parameter e? In each experiment, 
we measure the number of gate vertices and the number of 
edges in gate graph, and the running time of algorithms. 
To gain a better understanding of experimental results, we 
also report two important graph measures: diameter (refer to 
as Diam.) and average value of pairwise shortest distances 
(refer to as Avg.Dist), for each graph. We implemented our 
algorithms in C-n- and Standard Template Library (STL). All 
experiments were conducted on a 2.8GHz Intel Xeon CPU and 
12.0GB RAM running Linux 2.6. 

A. Real Data 

In this subsection, we collect 7 real-world datasets listed in 
Table Uto validate the performance of our approaches. Among 
them, CA-GrQc and CA-HepTh are collaboration networks 
from arXiv describing scientific collaboration relationships be- 
tween authors in General Relativity and Quantum Cosmology 



field, and in High Energy Physics field, respectively. Moreover, 
P2PG08, P2PG09, P2PG30 and P2PG31 are 4 snapshots of the 
Gnutella peer-to-peer file sharing network collected in August 
and September 2002, respectively. Wiki-Vote describes the 
relationships between users and their related discussion from 
the inception of Wikipedia until January 2008. All datasets 
are publicly available at Stanford Large Network Dataset 
Collection Q. 

Table reports the size of gate-vertex set and the number 
of edges in gate graphs by varying locality parameter e from 
3 to 6. Their corresponding shortest distance distribution and 
vertex degree distribution are shown in Figure |4] and Figure |5] 
respectively. Since the distances and vertex degrees of P2PG30 
and P2PG3 1 have similar distribution with that of P2PG08 and 
P2PG09, and their large values would affect other datasets' 
distribution visualization, we omit them in both figures. We 
make the following observations: 

Size of Gate- Vertex Set: Table HI] shows that the sizes of 
gate-vertex set discovered by both FS and SC are consistently 
smaller than that of original graphs. Among them, SC always 
obtains the better results, which are on average approximately 
76%, 65%, 63% and 56% of the one from FS with e ranging 
from 3 to 6. For SC approach, the size of gate-vertex set by 
SC is on average around 26%, 21%, 27% and 24% of the 
corresponding original graph when e varies from 3 to 6. We 
also observe that, as locality parameter e increases, the number 
of gate vertices discovered by SC is gradually reduced. Par- 
ticularly, reduction ratios of CA-GrQc, CA-HepTh and Wiki- 
Vote are consistently better than that of P2P08, P2P09, P2P30 
and P2P3L In Figure |5] CA-GrQc, CA-HepTh and Wiki-Vote 
seem to fit the power-law degree distribution very well, while 
there are a significant portion of vertices with degree ranging 
from 10 to 15 in P2P08, P2P09, P2P30 and P2P31. In other 
words, there exists a small portion of vertices with high degree 
potentially serving as the intermediate connectors for traffics 
between a large portion of vertex pairs in CA-GrQc, CA- 
HepTh and Wiki-Vote. By SC's gate vertices discovery method 
using set cover framework, those vertices can be selected as 
gate vertices and thus dramatically simplify original graphs. 
However, for file-sharing network, a relatively large number of 
vertices with high connectivity potentially leads to larger size 
of gate-vertex set by the same selection principle. From the 
perspective of application domains, the results of SC on three 
social networks (i.e., CA-GrQc, CA-HepTh and Wiki-Vote) 
suggest a small highway structure capturing major non-local 
communications in the network. Interestingly, the consistent 
decreasing trends regarding the size of gate-vertex set with 
increasing e are not observed in the results of FS on P2P30 and 
P2P3 1 . Since adaptive sampling approach follows the spirit of 
greedy algorithm - choosing each gate vertex only based on 
local information, the mis-selection of gate vertices at earlier 
stages probably leads to significant increase of gate vertices at 
later stages. In other words, some important vertices selected 
as gate vertices in the procedure with small e might be missed 

' http://snap.stanford.edu/data 



8 



in the procedure with larger e. Therefore, it is reasonable to 
observe that the number of gate vertices discovered by FS 
unexpectedly becomes larger when e increases. 

Edge Size of Gate Graph: The number of edges in original 
graphs are significantly reduced by SC on three datasets CA- 
GrQc, CA-HepTh and Wiki-Vote. Especially, on average, the 
number of edges in gate graphs generated by SC are 6.5, 6, 
6.3 and 6 times smaller than that of original graphs for e to be 
3, 4, 5 and 6. Besides that, SC still outperforms FS on those 
datasets, such that the number of edges in gate graphs by SC 
are on average about 49%, 51%, 53% and 48% of the one 
from gate graph by FS ranging e from 3 to 6. Interestingly, as 
e becomes larger, the number of edges in gate graphs generated 
by SC increases on CA-GrQc and CA-HepTh. The reason is, 
in order to guarantee that shortest paths between all non-local 
vertex pairs can be recovered utilizing fewer gate vertices, 
more edges are needed to build stronger connections among 
gate vertices. However, the number of edges in gate graphs 
generated by FS on CA-GrQc and CA-HepTh becomes smaller 
when e increases. This demonstrates the effectiveness of edge 
sparsification algorithm for pruning redundant edges, since 
some of gate vertices discovered by FS are non-essential and 
are not necessarily to be connected to its e neighbors. For other 
four datasets (P2P08, P2P09, P2P30 and P2P31), gate graphs 
generated by FS from those datasets contains fewer edges 
compared to the one of SC. Overall, they are on average about 
1.3, 1.1, 1.4 and 1.8 times smaller than that of SC varying e 
from 3 to 6. Also, as e increases, the number of edges in gate 
graphs generated by both approaches on those four datasets 
increases. This is consistent with our earlier discussion that in 
these graphs, their interactions seem to be more random and 
the relatively large number of vertices with degrees between 
10 to 15 may increase their chance to connect to other vertices 
with local walks. 

Running Time: We take e = 3 as an example. The running 
time of FS for all 7 datasets are 65ms, 132ms, 3s, 127ms, 
158ms, 447ms and 811ms. The running time of SC are 23s, 
53s, 1166s, 183s, 293s, 279s and 661s for CA-GrQc, CA- 
HepTh, Wiki-Vote, P2P08, P2P09, P2P30 and P2P31, respec- 
tively. As locality parameter e increases, the computational 
cost of both approaches become larger, because more vertex 
pairs should be considered in SC and more vertices would be 
traversed in FS. The average running time of SC on e = 5 can 
cost up to a few hours, which is around 100 times slower than 
that of FS. Indeed, the selection between FS and SC is a trade- 
off between reduction ratio and efficiency. In general, we can 
see that with rather smaller e (2 or 3), the vertex reduction 
by SC is quite significant which is also much better than 
that of FS, and their running time are reasonable in practice. 
In contrast to FS, the size of gate-vertex set discovered by 
SC is guaranteed to hold logarithmic approximation bound. 
Therefore, we would say SC with smaller e is applicable in 
most of applications. 



B. Synthetic Data 

In the following, we study two approaches on Scale-Free 
and Erdos-Renyi random graphs. 

Scale-Free Random Graph: In this experiment, we gener- 
ated a set of scale-free random graphs such that vertex degree 
follows power-law distribution using a publicly available graph 
generator H. The number of vertices in those graphs are lOK, 
and their edge density (i.e., ranges from 2 to 6. The 

diameter of those graphs are 10, 8, 7, 6, 6, and their average 
pairwise distance are 6.4, 5.1, 4.5, 4.2 and 3.9. 

We can see from Table |llll when locality parameter e 
increases, the size of gate-vertex set discovered by both 
approaches consistently decreases for graphs with different 
edge density. Similar to the observation in the real-world 
datasets, SC always achieves better results than FS in terms 
of the number of gate vertices. Overall, the size of gate-vertex 
set discovered by SC is on average around 93%, 90%, 69%, 
40% and 28% of the one of FS with e from 3 to 7. In 
addition, as edge density increases, when locality parameter 
e less than Avg.Dist, more gate vertices are discovered by 
both FS and SC in denser graphs. For denser graphs, since 
graph diameter becomes smaller, much more vertex pairs with 
distance e need to be covered compared to sparse graphs (see 
Figure |6l). Therefore, more gate vertices are required to serve 
as intermediate hops for any vertex pair When e is greater 
than Avg.Dist, the number of gate vertices discovered by SC 
is dramatically reduced since fewer vertex pairs need to be 
processed in set cover framework (e.g., sf5_10K and sf6_10K 
with e > 6). However, this phenomena is not observed in 
the results of FS, i.e., their gate-vertex sets are reduced very 
slowly. Even when e is greater than diameter, no gate vertex 
is actually needed while FS still discovers lots of gate vertices 
(e.g., sf5_10K and sf6_10K with e = 7). In terms of the 
number of edges in gate graphs, FS performs slightly better 
than SC. When locality parameter e is less than Avg.Dist, the 
results on both FS and SC consistently increase following the 
opposite trend of the number of gate vertices. With the increase 
of e, fewer gate vertices are discovered and more internal 
connections within gate graph should be built to guarantee 
that there is a shortest path between any pair of gate vertices. 

In terms of running time, when edge density increases, 
running times of both approaches are consistently increased. 
Taking e = 3 as example, running time of SC for scale-free 
graphs with density from 2 to 6 are 6s, 49s, 158s, 531s and 
1067s, respectively. The running time of FS for those graphs 
are 85ms, 170ms, 232ms, 344ms and 471ms, respectively. 
As e becomes larger, longer running time is expected due 
to more vertices will be visited in both SC (i.e., procedure 
BFSSetCoverConstruction) and FS (i.e., breath-first-search). 
When e = 6, the running time of SC is on average around 
30 times longer than that of e = 3, ranging from 430s to 
6617s. Also, running time of FS with e = 6 is significantly 
increased which varies from 18s to 1966s. As e increases, the 
efficiency advantage of FS over SC is dramatically reduced. 

^ http ://py webgraph. sourcef orge . net 
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DQtQSCt 


e = 3 


e = 4 


e = 5 


e = 6 


#v 




w 


#E 


#v 




#v 




FS 


SC 


FS 


SC 


FS 


SC 


FS 


SC 


FS 


SC 


FS 


SC 


FS 


SC 


FS 


SC 


CA-GiQc 


2836 


869 


9266 


2655 


1625 


655 


6848 


2933 


1116 


567 


5580 


2984 


908 


500 


5192 


2858 


CA-HepTh 


5131 


2208 


15831 


7674 


3381 


1669 


14921 


10241 


2525 


1364 


14316 


11249 


2134 


1157 


14476 


11456 


Wiki-Vote 


2564 


1598 


84607 


59132 


2457 


879 


85051 


34736 


2236 


584 


83681 


22652 


2964 


571 


193913 
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TABLE III: Sizes of Simplified Graph on Scale-free Graphs 
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Fig. 9: Edge Size in Gate Graph (Large Rand 
Graph) 



This is caused by the explosive increase on the running time 
of FS's edge sparsification, since the number of edges in local- 
gate graph of SC is much smaller than that of FS when e > 6. 

Erdos-Renyi Random Graph: In this experiment, we gener- 
ate a set of random graphs based on Erdos-Renyi model, with 
the edge density from 2 to 6, while keeping the number of 
vertices at IQK. The diameter of those random graphs are 14, 
10, 8, 7 and 6, respectively. Also, their corresponding average 
pairwise distance are 6.8, 5.3, 4.7, 4.3 and 4.0, respectively. 
Their shortest distance distribution is presented in Figure |7] 
By varying e from 3 to 7, Table |IV] shows the number of 



vertices and the number of edges in simplified graphs with 
respect to original graphs with different edge density. The 
observations for both approaches SC and FS on scale-free 
graphs are still hold on Erdos-Renyi random graphs. Overall, 
the sizes of gate-vertex set discovered by SC are on average 
approximately 88%, 86%, 66%, 41% and 30% of the one from 
FS with e from 3 to 7. In terms of the number edges in gate 
graphs, FS achieves slightly better results than SC. When e 
is no less than 4, the number of edges in gate graphs with 
different edge density by SC are on average 1.3, 2, 1.8 and 
1.6 times greater than that of FS, respectively. Given the same 
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e and edge density, the size of gate-vertex set from scale-free 
graph discovered by both approaches are slightly smaller than 
that of Erdos-Renyi graphs. This is true for the number of 
edges in gate graphs as well, when e is no less than 3. 

In general, the running time of both approaches on Erdos- 
Renyi graphs are faster than that of scale-free graphs. Es- 
pecially, since for relatively large e, the size of gate-vertex 
set discovered by SC is significantly smaller than that of FS, 
we also observe that FS takes longer time than SC on those 
datasets with large value of e. 

Large Random Graph: Finally, we perform this experiment 
on a set of Erdos-Renyi random graphs and scale-free random 
graphs with average edge density of 2, and we vary the number 
of vertices from lOOA' to 500A'. The locality parameter e is 
specified to be 4. The number of vertices and the number of 
edges in gate graphs are shown in Figure [8] and Figure |9] 
respectively. The diameter of 5 Erdos-Renyi graphs are 18, 
18, 19, 19, 21, and their average values of pairwise distance 
are 8.4, 8.9, 9.2, 9.4 and 9.6. For 5 scale-free graphs, their 
diameter are 12, 13, 13, 14, 15, and average values of pairwise 
distance are 7.8, 8.2, 8.4, 8.6 and 8.7. 

As we can see, the size of gate-vertex set discovered by SC 
is consistently smaller than that of FS in both types of graphs, 
while FS outperforms SC in terms of the number of edges in 
gate graphs. Moreover, as the number of vertices in original 
graphs increases, the size of gate-vertex set discovered by SC 
grows slower than that of FS. For both FS and SC, the number 
of discovered gate vertices from scale-free graphs are smaller 
than the one from Erdos-Renyi graphs. 

Overall, we observe that the reduction ratio of gate vertices 
on the real-world graphs is significantly better than that of the 
synthetic graphs. This suggests that in the real world graphs, 
its underlying structure is not that "random". In other words, 
the real graphs seems to have more recognizable "highway" 
structure in terms of the shortest path connection. From this 
perspective, the existing research on random graph generators 
have not been able to model this network behavior. 

VII. Conclusion 

In this paper, we study a new graph simplification prob- 
lem to provide a high-level topological view of the original 
graph while preserving distances. Specifically, we develop an 
efficient algorithm utilizing recursive nature of shortest paths 
and set cover framework to discover gate-vertex set. More 
interestingly, our theoretical results and algorithmic solution 
can be naturally applied for minimum fc-skip cover problem, 
which is still open problem. In the future, we would like to 
study whether approximate distance with guaranteed accuracy 
can be gained based on our framework. We also want to 
investigate how our simplified graph can be applied for graph 
clustering, multidimensional scaling and graph visualization. 
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